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Abstract 

Accurate semantic labeling of image pixels is difficult be¬ 
cause intra-class variability is often greater than inter-class 
variability. In turn, fast semantic segmentation is hard be¬ 
cause accurate models are usually too complicated to also 
run quickly at test-time. Our experience with building and 
running semantic segmentation systems has also shown a 
reasonably obvious bottleneck on model complexity, im¬ 
posed by small training datasets. We therefore propose 
two simple complementary strategies that leverage context 
to give better semantic segmentation, while scaling up or 
down to train on different-sized datasets. 

As easy modifications for existing semantic segmenta¬ 
tion algorithms, we introduce Decorrelated Semantic Tex- 
ton Forests, and the Context Sensitive Image Level Prior. 
The proposed modifications are tested using a Semantic 
Texton Forest (STF) system, and the modifications are vali¬ 
dated on two standard benchmark datasets, MSRC-21 and 
PascalVOC-2010. In Python based comparisons, our sys¬ 
tem is insignificantly slower than STF at test-time, yet pro¬ 
duces superior semantic segmentations overall, with just 
push-button training. 

1. Introduction 

For many applications, such as navigation or robot- 
interaction, semantic segmentation of images needs to be 
both accurate and fast to be worthwhile. The environ¬ 
ment can change more or less abruptly, but typically, many 
frames will have combinations of the same frequently co¬ 
occurring classes. We leverage this persistence of context 
to improve pixel classification accuracy, given finite quanti¬ 
ties of training data. 

We build on the successful Semantic Texton Forest 
(STF) II 22 I approach, and enhance it through two main con¬ 
tributions. First, the Decorrelated Semantic Texton Forest 
(DSTF) is proposed as a variant to the STF that essentially 
preserves the original’s efficiency. The DSTF uses hierar¬ 
chical clustering to decorrelate classes that have confus¬ 


ingly similar appearance for an STF We further improve 
accuracy by introducing a Context Sensitive Image Level 
Prior (Context Sensitive ILP). Training this multi-label prior 
to account for the co-occurrence of classes proves to be very 
helpful and substantially better than the more typical multi¬ 
class training of ILPs. 

2. Related Work 

Many semantic segmentation algorithms require care¬ 
fully tuned models and/or fully connected Conditional 
Random Fields (CRFs) to produce accurate per-pixel 
labelings. Here, we review the most relevant such methods, 
as well as algorithms similar to ours that are close to 
state-of-the-art, but sacrifice some accuracy for improved 
test-time efficiency. 

General Algorithms 

We base our approach on the basic STF because we wish 
to leverage its low computational complexity, allowing for 
very fast implementations if needed. Shotton et al. intro¬ 
duced STF as a component of their Bag of Semantic Textons 
(BoST) II 22 I model. BoST is one of the earliest algorithms 
to still be competitive in semantic segmentation challenges, 
appearing on the leaderboards of many semantic segmenta¬ 
tion papers ID HU El. BoST still outperforms other state of 
the art algorithms in some categories, as shown in Table 
despite running in real-time. The two main components of 
the BoST model were i) use of the newly introduced STF, 
and ii) application of the Textonboost ll^ approach for en¬ 
coding local context information, generating the BoST for 
each patch. We henceforth refer to the first component as 
STF, and to their combined approach as BoST. The STF 
is trained by growing extremely randomized trees with raw 
pixel image patches as features. Leaf nodes store the class 
distributions of the image patches that reached that node. 
Although the whole STF process is considered very effi¬ 
cient at inference time, the STF by itself produces fairly low 
quality results, because raw pixel patches are often not ex¬ 
pressive enough to be discriminative between classes. BoST 
improves the results dramatically, but with some computa- 
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tional overhead. Our proposed system modifies the STF by 
adding only a little overhead, but achieves significantly bet¬ 
ter performance compared with BoST ll22l . 

One approach known for using an image level prior is 
Q. Their overall system has a chain of stages; i) extracting 
patches and their low-level features, ii) constructing high- 
level features (Fisher vectors), iii) training (predicting, in 
test time) a class scoring unit using the high-level features, 
iv) assigning scores to a pixel and propagating scores to 
oversegmented regions, and finally v) integrating with their 
image level prior to refine their labels. Comparing to us, 
we skip (i) and (ii) which are bottle neck of the algorithm 
and use DSTF which very efficient because no feature ex¬ 
traction is required; their image level prior does not exploit 
co-occurrence statistics but model only the presence of each 
class individually. By combining three simple components: 
local appearance scoring, context sensitive ILP, and loca¬ 
tion potential, we show that our method is more simple and 
performs better than 121 on the MSRC-21 dataset, 77% to 
65% on average recall. 

Another system that proposes a simple architecture is 
ifTOll which devised a multi-scale Convolutional Neural Net¬ 
work (CNN) to extract features of a pixel for the scene label¬ 
ing task. Their multi-scale CNN is designed to capture dif¬ 
ferent levels of information, ranging from small region ap¬ 
pearance, neighborhood context, and up to the global con¬ 
text of the image. The system remains simple by having 
only two components: pixel classification and simple post 
processing to smooth the classification result. However, the 
system required a specially designed model and careful pa¬ 
rameter tuning at training time to get comparable result to 
the state of the art algorithms. It is therefore hard to re-build 
the training step and to test on different datasets. They also 
made a version for RGB-D data from indoor scenes E). In 
our approach, although we use CNN image level feature de¬ 
scriptors, we picked a general purpose feature generator Q 
that can be used out of the box without any parameter tun¬ 
ing. At training time, our system needs very little parameter 
tuning to achieve good results on different datasets. 

CRF based Algorithms 

CRFs are used in many semantic segmentation algo¬ 
rithms to regularize output labels. The Hierarchical Con¬ 
ditional Random Field (HCRF) iflfil uses different levels of 
quantization, from pixels to segments. They operate under 
the assumption that there is unlikely to be a single optimal 
quantization level that is suitable for all object categories. 

Beyond regularizing just neighboring pixels, several 
works model the relationships between all pixels. In the 
Dense CRF semantic segmentation of oa, mean field 
approximation and Gaussian filtering is used to make 
inference in fully connected models practicable. Further, 
m demonstrates that using a Dense CRF to infer all test 
images at once gives better results than inferring one 


image at a time. Our approach shows that a 4-connected 
neighbor CRF model can achieve results comparable to the 
fully connected model (both were tested on the MSRC-21 
dataset). 

More Sophisticated Algorithms 

Co-occurrence statistics had been exploited in semantic 
segmentation systems to boost accuracy. The HCRF was 
improved further in El by incorporating a co-occurrence 
potential as per-image context information, into the CRF 
energy function. We propose a simpler system that also 
uses the co-occurrence statistics, but incorporates them in 
a different manner. Our system achieves comparable av¬ 
erage recall scores to El, without tuning our parameters 
per-dataset. 

Gonfaus et al. El proposed another improvement to the 
HCRF, by adding a new consistency potential to the model, 
called the harmony potential. The harmony potential en¬ 
codes all possible combinations of labels, allowing regions 
to have more than one class, which was a perceived lim¬ 
itation of HCRF models. Further, in Q, they introduced 
three more cues into the local unary potential to improve 
recall scores over their previous version. While certainly 
worthwhile, these algorithms, e.g. miniiigiEi, achieve 
ever better results by adding complexity to their models. 
Our system maintains a very plain model, using a simple 
4-connected CRF with potts pairwise potentials to encour¬ 
age harmonization of the neighboring pixels. Our proposed 
simple system outperforms El on both average and global 
recall scores. 

A sophisticated model was successfully demonstrated in 
lEl, where the problems of semantic segmentation, object 
detection, and scene classification were cast as one holistic 
CRF model. Their parameters are learned via a structured 
learning algorithm, and inference is accomplished by a con¬ 
vergent message-passing algorithm. The model exploits 
various cues, such as scene type, co-occurrence statistics, 
the shape and location of the object, and different quantiza¬ 
tion levels to boost the segmentation result. In contrast, our 
proposed system exploits some of these important cues as 
context, but integrates them together with a much simpler 
model, achieving accuracy that approaches that of the more 
sophisticated model. 

Most recently, CNNs have also been exploited in a more 
sophisticated semantic segmentation framework El . They 
compute feature vectors for each proposed region using two 
CNN’s, trained especially on bounding boxes and free-form 
versions of the region. Thereafter, the concatenated feature 
vectors are passed through a linear SVM classifier to get a 
per-class potential. The final class label for each region is 
assigned via non-maximum suppression. Although, the sys¬ 
tem performed very well on the PascalVOC 2012 dataset, it 
did so at the expense of algorithm complexity. 
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3. Cheap Semantic Segmentation Model 

A good semantic segmentation algorithm should exploit 
different levels of information; local appearance, global ap¬ 
pearance, the context of the scene, and location statistics of 
objects. We leverage this information with an emphasis on 
simplicity, so that both large and small training sets can be 
exploited, and for efficiency at test-time. 

3.1. Local Appearance and a Classifier 

The cornerstone of visual understanding is having a 
class-covariant local appearance representation and a com¬ 
patible classification model. Although superpixels and 
region-based methods encode neighborhood information, 
we opt to work on individual raw pixels, curtailing the need 
to select superpixel algorithms and parameters per dataset. 

The Shotton et al. STF ll22ll is one of the simplest useful 
local appearance classifiers because it processes raw pixel 
values of a small image patch without constructing a sep¬ 
arate feature descriptor. Shotton et al. Il22l also propose 
BoST, working in concert with an STF, as a significant 
contribution, because an STF has limited expressiveness 
and moderate classification accuracy in itself. The price of 
BoST’s improved accuracy is its significant computational 
cost, so we proceed with just the STF model and represen¬ 
tation. 

3.1.1 Learning from Confusion 

The first proposed contribution of this paper is to introduce 
the Decorrelated Semantic Texton Forest (DSTF), which is 
an improved version of the STF with only slight additional 
computational cost at test time. The DSTF emerged from 
our observation that very similar appearance patches can 
reach the same leaf node in an STF tree at training-time, 
even when they have different class labels. This problem 
occurrs in a significant minority of cases. Therefore, the 
DSTF is designed specifically to reduce the incidence of 
such high-entropy leaf-nodes. 

The DSTF assumes that such “confused” leaf nodes are 
populated with patches from distinct scene types, or cate¬ 
gories. To distinguish them, we add an upstream classifier 
to infer a scene categorjj^for each input image. The inferred 
scene category dictates which single specialist STF should 
process that image. We train a set of separate STF’s, one for 
each scene category. 

The scene categories are determined automatically, af¬ 
ter growing a single temporary STF, depicted at the top of 
Figure [T] The choice of categories we seek aims to group 
patches whose visual appearance does not confuse a single 
STF, and split visually similar patches from different classes 
that do confuse it. 

*The terms ‘scene category’, ‘scene,’ and ‘cluster’ all refer to the same 
concept when we are explaining our algorithm. 



Figure 1. Flow chart for training an upstream scene-category clas¬ 
sifier (the Cluster Recognizer), and the downstream Decorrelated 
Semantic Texton Forest (DSTF) composed of multiple STF’s, with 
one STF per scene category. 


Clustering Classes The temporary single STF is trained 
with all the images and classes. A label is associated with 
each pixel, which serves as the center of each small patch. 
In practice, we re-implement STFs of the original paper ll23l 
and use the same parameter values, but without including 
their training invariance, because those parameters were 
not specified. Our early experiments showed that including 
some training invariance only improves the STF marginally. 

Next, a class correlation matrix O is calculated by treat¬ 
ing the class distributions at the leaf nodes of the trained 
STF as observations. Let X = {Xi, be the set of 

class distributions at the T leaf nodes of the entire trained 
STF. Xi = {P(ci|7i),..., T’(cc|7i)} is a C-dimensional 
column vector of class probability at the leaf node i, con¬ 
ditioned on training examples % that reached node i. The 
entries of class correlation matrix il are 


^{x,y) 


_Cov(a;, y) _ 

^yCov{x,x) * Cov{y,y)' 


( 1 ) 


where Cov(a:, y) is the covariance between class x and y 
observed from the data X as row-slices across X. We then 
cluster the semantic classes of the original problem by their 
correlation values, where the distance function is defined by 


Dist(a;,j/) = V.{x,y) — min(n), (2) 


and Dist(a;,a;) = 0. We subtract min(n) from il{x,y) to 
make the smallest distance equal to zero. We group seman¬ 
tic classes by hierarchical clustering. To commit to hard 
cluster boundaries, we choose the minimum intra-cluster 
distance that forces every cluster to have at least three 
cluster members (classes), to prevent generating trivially 
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small clusters. 

Gathering Images We use the class-clustering to gather the 
training images into new decorrelated (or less-correlated) 
training sets. We opt to gather images instead of patches to 
avoid overfitting. Gathering all images that contain class c 
would risk piling almost all the training images into some 
“specialist” STF’s, if a class is prevalent throughout, e.g. 
sky. From experiments on our smallest dataset, MSRC-2rs 
validation set, we found it already effective to use no more 
than the top-7% of all training images when training one of 
the cluster level STF’s. 

The procedure to rank the images for each cluster fol¬ 
lows. First, the class co-occurrence matrix T* is computed 
from the ground truth images. Each element of T* repre¬ 
sents the probability of observing a class y given an image 
of class X, T'(x, y) = P{y\Ix)- Based on the matrix T', we 
rank instances (images) for each class c in the cluster sepa¬ 
rately, by assigning each image the score 

S(c,G)=^vi/(c,/:(z)), (3) 

ieG 

where i is a pixel in a ground truth image G, and £(•) re¬ 
turns the ground truth label for the input pixel. 

Training Sub-STF’s and the Cluster Recognizer From 
those decorrelated image training sets, we train separate 
standard STFs and the cluster recognizer. The cluster rec¬ 
ognizer is a very fast linear SVM, trained on off-the-shelf 
CNN feature vectors 0. At test time, the image is fed to 
the trained cluster categorizer that will redirect the image to 
an appropriate STF. 

A per-class comparison on the MSRC-21 dataset be¬ 
tween normal STF and our DSTF is demonstrated in Fig¬ 
ure 1^ From the figure, we can see that DSTF improves 
the segmentation accuracies for almost every class. Further 
analysis is deferred until Section [3. 2. 1| 

MSRC 21-class dataset: per class comparison 


STF 

iDecorrelated STF 



Figure 2. MSRC-21: Per-class comparisons of the DSTF results 
and STF results, measuring Recall rates. 

Efficiency Analysis We compare the efficiency of our pro¬ 
posed DSTF against BoST ll22ll because BoST is known to 


be a real-time semantic segmentation system that can be run 
on a standard 2008 PC. For simplicity, we count one infer¬ 
ence computation of a decision tree or an SVM as one op¬ 
eration. Our goal is to compare the number of operations 
at test time. For ||22l, each pixel is routed through another 
randomized decision forest. At each node, the BoST for a 
related region of size R pixels is constructed by inferring 
the STF for every pixel in the region. Hence, the number of 
operations required for predicting one pixel is 0{lR) where 
I is the number of levels of the randomized decision for¬ 
est. Whereas the DSTF requires inference by one scene- 
classifier, a linear SVM, to route the pixel to an appropriate 
STF, then inference by an STF for that pixel. Therefore, at 
test time, the DSTF spends merely two operations for one 
pixel prediction. 

3.2. Global Appearance and Objects Co- 
Occurrence 

Experments from previous work El H Ea El confirm 
that doing global or local detection concurrently with seg¬ 
mentation can give significantly better segmentation results. 
In addition, HIED incorporate the co-occurrence statistic 
as context information, to their detection and/or segmen¬ 
tation system, and showed that such context can improve 
accuracy. However, to the best of our knowledge, there is 
no work that trains using the two cues of a) the presence 
of the object in the image and b) such context information, 
together, to improve the segmentation process. 

Our second contribution is to propose the Context Sensi¬ 
tive Image Level Prior (Context Sensitive ILP or ILPcont). 
Please note that, the terms context and co-occurrence will 
be used interchangeably from now on. Prom the exper¬ 
iment, Table [T] shows that selecting the algorithm, which 
is aware of the co-occurrence of classes, to generate the 
Image Level Prior (ILP) produces much more promising 
results than the algorithm that trained to detect each class 
separately. The details of the Context Sensitive ILP are dis¬ 
cussed next. 

3.2.1 Context Sensitive Image Level Prior 

Although Image Level Prior or image level class detection 
has been proved to be useful for semantic segmentation in 
recent papers, e.g. El 122. previously the ILP only mod¬ 
els the presence of classes for a single image. We propose 
a new image level class detection that takes into account 
co-occurrence statistics of the classes in the entire training 
data. The use of multi-label randomized trees allows us to 
model the global appearance of a single image and the co¬ 
occurrence statistics of the classes of entire training data at 
the same time. The multi-label randomized trees is first pro¬ 
posed in il, but it was used in different paradigm which is 
predicting a structured output. We, contrastingly, use the al- 
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gorithm to learn both implicit class co-occurrence statistics 
of the training data and presences of classes in an image. 
Even though the algorithm is used to approach slightly dif¬ 
ferent problems, the algorithm can be directly applied with¬ 
out any modihcation. 

In this section, we will give a brief explanation of the 
multi-label randomized trees algorithm 0. Multi-label ran¬ 
dom forests are random forests with a minor modihcation 
on the splitting quality metric. The metric is used to mea¬ 
sure how well a feature splits the data at the node. The 
modihcation is made to take into account more than one 
class for the node splitting instead of a single class in the 
original randomized tree based model. The metric is based 
on the Gini entropy, and the modihcation are as follows. 


1 . ' 

score(T, = — ^score'"(T, J"), (4) 

^ k=l 

score^(T,^) = Gk(T)-GkiMT), (5) 

^ n n ^ 

Gk\AT) = GkiTi) + Gk{Tr). (7) 


where T is data of size n that reaches the node and T de¬ 
notes a test function that routes subset of the data 71 to its 
left child node when all member of 71 satisfy otherwise 
routes the data % to the right child node. C is the number of 
semantic classes. Ckiti) is the function that returns 1 when 
Lik = 1, and 0 otherwise; and Lik indicates the presence of 
class k in the datapoint 

Table compares results of the original Semantic Tex- 
ton Forests with our 2 proposed components. Please note 
that, combining the original STF with context sensitive ILP 
outperforms the full model of BoST that proposed in ll22l : 
our system has average recall 68.03% compared to 66.9% 

of Ea. 


Method 

no ILP 

normal ILP 

context ILP 

STF (average) 

31.02% 

35.29%* 

68.03% 

DSTF (average) 

39.67% 

41.56% 

70.07% t 

STF (global) 

44.00% 

50.16%* 

72.21% 

DSTF (global) 

49.82% 

55.52% 

73.97% t 


Table 1. MSRC-21: Average and Global recalls of Decorrelated 
Semantic Texton Forests and Context Sensitive ILP (f) and origi¬ 
nal Semantic Texton Forests and ILP (*). 


3.3. Location Potentials 

The last crucial ingredient are the location potentials. 
The location potentials are simply the statistics of how 
likely each absolute location in the image to be occupied 
by particular classes. The location potential is also used in 

El- 


In this work, training images are hrst split into 2 
groups:portrait images and landscape images. Next, for 
each group, we count the frequencies of each absolute lo¬ 
cation to be landed by a particular class. After having the 
location potentials for each class (each class has 2 location 
potentials; portrait and landscape), the location potentials 
are used as look up tables for an input location. 

Figure [^illustrates the importance of the location poten¬ 
tials comparing to DSTF. 
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Figure 3. MSRC-21: Influence of location potentials, ranged from 
using pure DSTF, a; = 0, to pure location potentials, w = 1 in 
the model. Note that, we optimise the system on average per-class 
recall. 


3.4. Integration of the Components 

A simple Conditional Random Field model is selected 
to assemble the components. Simple, in our case, refers 
to utilising an ordinary grid graph, where each pixel has 4- 
connected neighbors, and the potts pairwise potential. This 
potential assigns low energy to two adjacent pixels with the 
same class, and high energy otherwise. 

More formally. We cast the inference of our system as a 
minimization of the energy function 

E{xi,...,xn\I) = + X! 

ieN {id)eP 

(8) 

where Xi is a random variable associated with the test image 
7. The random variable Xi can be assigned one of the labels 
in the label set c = {ci,..., cc}- N is the number of pixels 
in the test image 7 = {xi, ...a; tv}, and P is a set of all pairs 
of neighboring pixels, t/j is dehned as the potts potential and 
phiis the unary potential dehned as, 

= (1 — w) * DSTF{xi) -F a; * Location{xi)^ (9) 

and (j/ is the image level prior of image 7. Minimization is 
carried out using the graph cuts algorithm of Boykov et al. 

0 . 

Figure [^compares the average per-class recall results of 
our whole system, and when certain components are miss¬ 
ing. The blue dotted line shows our results when all compo¬ 
nents were trained with standard data splitting, as per EH- 
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The red dotted line shows average per-class recalls of the 
system when only the ILP was trained with additional data, 
sampled from the test data, so the test data size is smaller 
than the standard one. The yellow dotted line shows the re¬ 
sult when the ILP was trained in the unusual ways: the black 
star is the result when ILP was trained on the entire dataset, 
with no unseen data for the ILP, and the magenta star rep¬ 
resents the result of Ideal ILP, assuming that we know the 
actual image tags. 

4. Results by Dataset 

We evaluate our approach via two well-known seman¬ 
tic segmentation datasets, MSRC-21 and PascalVOC-2010. 
MSRC-21 is now considered a small older dataset, but it is 
commonly used for validating semantic segmentation algo¬ 
rithms. Whereas the latter, PascalVOC-2010, is one of the 
newest and largest datasets. We choose these datasets to 
prove that our approach is robust with limited training data, 
as well as with great diversity of scenes. The main reason 
we select PascalVOC-2010 over the newer or the older ver¬ 
sions of the same competition is the recently published finer 
ground associated with it ll20l . 

4.1. MSRC-21 [231 

The MSRC-21 dataset is composed of 591 images of size 
320 X 213 and 213 x 320. The segmentation ground truth 
is made up of 21 classes which are mixed between back¬ 
ground classes and object classes. The parameters we are 
using for the MSRC-21 dataset are the same as in ll22l . with 
one additional parameter lo to weight between the appear¬ 
ance potential (DSTF) and the location potential. We tune 
the extra parameter using the validation set as shown in Fig¬ 
ure [3] 

Table demonstrates that both the proposed DSTF and 
context sensitive ILP work to complement each other, and 
give «9% and ^,2,1% improvement respectively. Figure]^ 
shows that our full model (integrating DSTF, Context Sensi¬ 
tive ILP, and Location potential by simple CRF) can achieve 
a result that is comparable to state of the art results. Besides, 
when the ILP has more training data, our proposed model 
even beats the top algorithm of this dataset. Table shows 
detailed results for each class compared to state of the art 
algorithms. One can observe that our system performs well 
on all classes. Qualitative results can be found in Fig.|^ 

4.2. PascalVOC-2010 ISI 

The PascalVOC-2010 semantic segmentation dataset 
consists of 964 training images, 964 validation images, and 
964 test images. The dataset has 20 object classes and 1 
background class which includes everything but the 20 ob¬ 
ject classes. The background class occupies 60.1% of all 
pixels in the training and validation set ||20l. As the ground 
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Figure 4. Qualitative results on the MSRC dataset. 


truth for the test set is not publicly available, the organiz¬ 
ers run an evaluation server where users can submit up to 
two submissions per week. We test our algorithm on this 
dataset using default parameters, i.e. the same parameters 
used for the MSRC-21 dataset. In addition, ll20l relabeled 
the ground truth by adding more classes and removing the 
background class. We also show the results of training our 
ILP with this new ground truth data on the standard 20- 
object-classes, and with additional context classes, such as 
water, sky, road, etc. (H-Add_context). 

Table|^illustrates our quantitative results on the test data. 
Alone, STF and DSTF do not perform well with this dataset, 
since the data is more diverse and has a very large and com¬ 
plex background class. However, the Context Sensitive ILP 
still produces impressive improvements, improving the re¬ 
sult by «!?% over the pure STF and DSTF, comparing to 
an improvement of only «?% by using the multi-class ILP. 
Interestingly, our Context Sensitive ILP coupled with just 
Location potentials is proving very powerful, despite miss¬ 
ing out on substantial information available to the full sys¬ 
tem. 

Although the new ground truth on VOC has better 
ground truth, we can see that the accuracy decreases. The 
relabeling process has modified the ground truth for the 
standard 20-object-classes, but we still evaluate the result 
via the evaluation server which evaluates based on the old 
ground truth. Furthermore, adding more context classes can 
hurt the accuracy of the system because a larger number of 
classes reduces the ILP prediction accuracy. Fig. 0d emon- 
strates some qualitative results. 
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Figure 5. Qualitative results on the PascalVOC-2010 dataset when 
using our approach. We illustrates the results overlayed on the 
image since the ground truth for is not available. 


Methods 

loU 

Train on original ground truth ||9l 


STF 

1.454 

DSTF 

0.858 

STF-tlLPmult 

6.019t 

STF-nlLPcont 

16.656t 

DSTF-nlLPcont 

16.947t 

Location-tlLPcont 

21.474* 

STF-tLocH-ILPmult 

7.060t 

STF-tLocH-ILPcont 

21.403 j: 

Our full model (DSTFh-Loch-ILP cont) 

2L588j; 

Our full model (DSTFh-Loch-ILP cont) 

24.058* 

Train ILP on the new ground truth II20I 


STFH-LocH-ILPmult 

6.585t 

STFn-LocH-ILPcont 

17.760t 

STFH-LocH-ILPcontH-Add_context 

16.875t 

State of the art algorithms 


Topic model ITITI 

27.8 

DenseCRF ifTSll (non-standard test set) 

30.2 

HCRFh-Cooc ini 

30.3 

Whole ll24l (non-standard test set) 

31.2 

Harmony 4 ||2l 

38.0 

Composite llT9l 

49.6 


Table 2. Intersection over Union score of the system on 
PascalVOC-2010 dataset. We demonstrate the results from differ¬ 
ent combinations of our proposed components. Please note that, 
ILPmult and ILPcont represent the multi-class image level prior 
and context sensitive image level prior respectively. Loc stands 
for location potential, and Add_context represents us training the 
ILP with extra context classes: sky, road, building, water, grass. J 
indicates that the methods use that same set of parameters, to make 
the numbers comparable; * indicates the parameter was tuned by 
cross validating, fixing the ILP for class background to probability 
0.1,0.2, ..., I.O. 

5. Limitations 

We validated our proposed system on another standard 
dataset, that demonstrates a predictable limitation of our 


approach. The CamVid dataset ID consists of image se¬ 
quences of road scenes, where the ground truth labels as¬ 
sociate each pixel with one of the grouped 11 semantic 
classes. To be comparable to other algorithms, we down- 
sample all the images by a factor of 3 as in mi. 

Tablej^shows the quantitative results of our algorithm on 
the CamVid dataset. Since most of the images have almost 
the same set of classes present in them, context information 
is not useful here. Therefore, we can see that the Context 
Sensitive ILP performed worse than the normal multi-class 
ILP because the context sensitive ILP can extract very few 
co-occurence patterns from the training data. DSTF also 
hurts accuracy because the training sub-STFs are not really 
specialized, Le. they have access to artificially small train¬ 
ing sets, with little difference between scene categories. 


Methods 

Average 

Global 

STF 

29.95 

27.25 

STFn-ILPcont 

27.31 

27.20 

STFH-ILPmult 

29.53 

29.38 

DSTF 

10.41 

7.38 

DSTFH-ILPmult 

26.84 

28.32 

LocH-ILPmult 

27.85 

55.96 

S TFH-LocH-ILPmult 

40.27 

59.39 

Full model (DSTF-nLocn-ILPmult) 

34.56 

52.23 

State of the art algorithm 

Combining Object Detection ifTSl 

62.5 

83.8 


Table 3. Average and Global recalls of the system on CamVid 
dataset. We tested our proposed model on different combinations 
of the components. 


6. Conclusion 

We have shown that a combination of simple techniques 
can yield excellent accuracy, given only a limited compu¬ 
tational budget. Our DSTF shows an impressive ability to 
empower the inaccurate appearance predictions of a normal 
STF, with only a small extra overhead. This is notewor¬ 
thy becaus each sub-STF is working with less training data. 
The Context Sensitive ILP proved quite capable of recov¬ 
ering from even fairly bad appearance predictions. While 
other ILP models have been proposed previously, using the 
co-occurrence statistics jointly with image level class de¬ 
tection can now be accomplished cheaply, and can yield a 
substantial improvement in accuracy. 

We are making our code publicly available. Many exten¬ 
sions for the future are possible because the existing system 
is simple and complementary to many other approaches. A 
natural extension would use fast filters over the image as ex¬ 
tra appearance channels in the DSTF. It could also be fruit¬ 
ful to learn a variety of location potentials, i.e. for different 
camera poses, e.g. from car-mounted or hand-held cameras. 
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Figure 6. Average per-class recall results of the system (best viewed in colors) with or without certain components. The results are compared 
to the state of the art m , green dotted line. Important notations: ILPmult (normal multi-class ILP regardless of the co-occurrence statistic), 
ILPcont (the proposed Context Sensitive ILP), ILPcont-l-N (the proposed Context Sensitive ILP when training with standard training and 
validation data -l- N images sampled from standard test data and testing on unseen data,therefore test data - N sampled images), ILPcont_seen 
(the context sensitive ILP trained on all data; thus no unseen data for the ILP), and idealILP (When the ILP component produce 100% 
correct prediction), and Loc (Location potentials). 



building 

tree 

road 

cow 

grass 


sheep 

aeroplane 

body 

face 

book 

water 

t5 

U 

bicycle 

flower 

sign 

bird 

chair 

U 

dog 

boat 

Average 

Global 

BoST 1221 

49 

79 

78 

97 

88 

78 

97 

82 

66 

87 

93 

54 

74 

72 

74 

36 

24 

51 

75 

35 

18 

67 

72 

Harmonyi |l 11 

60 

77 

76 

91 

78 

88 

68 

87 

56 

73 

95 

76 

77 

93 

97 

73 

57 

81 

81 

46 

46 

75 

77 

HCRF+Cooc inl 

82 

88 

93 

73 

95 

100 

88 

83 

65 

88 

85 

92 

87 

88 

96 

96 

27 

37 

49 

80 

20 

77 

87 

Den.seCRF [15] 

75 

91 

90 

84 

99 

95 

82 

82 

80 

89 

98 

71 

90 

94 

95 

77 

48 

61 

78 

48 

22 

78 

86 

Whole 1241 

71 

90 

89 

79 

98 

93 

86 

8S 

68 

90 

97 

86 

84 

94 

98 

76 

53 

71 

83 

55 

17 

79 

86 

Harmony 4 (2J 

66 

84 

82 

81 

87 

93 

83 

81 

70 

78 

90 

82 

86 

94 

96 

87 

48 

81 

82 

75 

52 

80 

83 

Large Scale [IJ 

73 

90 

90 

85 

99 

95 

82 

86 

87 

91 

96 

74 

88 

91 

96 

83 

54 

79 

81 

60 

18 

81 

87 

Our 

51 

79 

85 

92 

96 

81 

90 

68 

59 

93 

95 

84 

76 

92 

98 

75 

64 

71 

86 

47 

34 

77 

SO 

Our+100 

5S 

86 

89 

92 

97 

87 

93 

67 

61 

90 

94 

87 

75 

92 

100 

88 

70 

83 

77 

91 

70 

83 

85 

Our+200 

71 

86 

88 

94 

98 

86 

95 

79 

68 

83 

99 

86 

76 

96 

100 

100 

91 

94 

80 

85 

51 

86 

88 

Our+Seen 

68 

88 

87 

94 

97 

89 

94 

71 

60 

92 

99 

92 

85 

93 

98 

93 

78 

96 

89 

81 

61 

86 

88 

Our+Ideal 

71 

88 

87 

94 

96 

90 

95 

70 

59 

91 

99 

91 

83 

89 

98 

92 

79 

97 

90 

81 

71 

86 

88 

Harmony 4 +Ideal |2J 

68 

92 

89 

86 

93 

97 

88 

91 

60 

73 

100 

85 

86 

94 

100 

89 

77 

96 

95 

94 

74 

87 

89 


Table 4. MSRC-21 segmentation results. Note that we show results of our system and with Seen ILP and Ideal ILP to show the upper 
bound of the systems thus we do not include them to the comparison. 
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